off-policy evaluation and policy optimization
Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases?
Review for NeurIPS paper: Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Weaknesses: The study of bias issue is important, but I am not fully convinced the motivation of this so-called "confidence interval". Normally the confidence interval is designed for uncertain quantification and thus of great practical interest. However, although the authors explicitly point out they do not consider uncertainties, this will rule out all the important applications that typical CI could do (safe RL or else) (this CI will not be valid in practice due to estimation error). Thus, I can only view the contribution in this paper as sort of additional guarantee for the algorithm proposed in "Minimax Weight and Q-Function Learning for Off-Policy Evaluation" since the algorithms are the same. Solely quantifying a bias of an existing estimator may not be viewed as sufficiently significant.
Review for NeurIPS paper: Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
The paper provides a very general minimax framework for quantifying the bias/approximation error in off-policy evaluation, and the results apply to a range of OPE methods. Reviewers generally agree that this is a good paper and there is contribution. One potentially improvable direction would be to quantify the statistical noise in off-policy evaluation, which is nontrivial but extremely important. Reviewers, AC and SAC also agree that such analysis could be left for future work. We would also like to strongly suggest that the authors consider rephrase/explain the wording "confidence interval".
Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class.